-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MI300A Race Condition Fix #185
base: develop
Are you sure you want to change the base?
Conversation
Currently we are prone to a race condition on APUs in TRLTOG/TRGTOL in the following way: DO J = 1, N
VARIABLE = VALUE
!$ACC DATA COPYIN(VARIABLE) ASYNC(1)
!$ACC PARALLEL DO ASYNC(1)
DO I = 1, M
! do something with VARIABLE
END DO
!$ACC END DATA
END DO On APUs the On GPUs you don't have this problem because the device-side Paul's fix here is to essentially make Is that right @PaulMullowney? |
This is a race condition in the code and we have been lucky so far that it did never show up - nice catch! Though, my recommendation for a fix here is to rather stay with a one-dimensional IFLDA array, something along the line of (pseudocode)
Later, just use IFLDA(ISEND_FIELD_OFFSET(ISETV)+X) rather than IFLDA(X). This will save us an array that depends on the number of ranks * # fields, which is something I think we want to avoid if possible. |
Correct. Provided that the the COPYIN and PARALLEL region are submitted to the same stream in every loop iteration, I don't think this can be an issue on discrete GPUs. This is because subsequent COPYINs (iteration i) will wait until the previous PARALLEL region (iteration i-1) finished. |
A very interesting suggestion from Lukas there. I'll try coding that up. It seems obvious now, I wonder why it wasn't already like that. |
Is the primary advantage to this approach memory savings? From my perspective, simpler, easier to read (and debug) code is of high value when the resource/performance benefits are insignificant. In any event, I am about to push 1 more commit that refactors a bit more. In the 2 modules, IFLDA was being computed in multiple places depending on whether one had self communication or not. I am trying to combine these into a single initialization with copy to device. If I were you, I would try to implement off of this newer version. Also, this new commit will be rebased off of the latest develop. |
6c2a79e
to
ee30d43
Compare
Well, I don't want to fight for one or the other solution, I just wanted to lay out an alternative I think is worth investigation.
Not quite memory savings, but yes I think we should avoid copies and allocations that scale with number of rank * number of fields. This might not be as dramatic here, so I totally agree, this is not the only argument. I really didn't want to offend anyone, we can also do it in two steps; I really just looked at this change and thought - very valid point you bring up, but I also see an alternative :) Finally regarding
It can, at least on NVIDIA hardware, so I am surprised we didn't run into it. Since all kernels and data regions are async, the CPU loop counter can run ahead and pile up copies of 1D IFLDA. The actual execution of the copies can happen after IFLDA was already overwritten by a next iteration. |
No offense taken. My motivation is to get this branch and 1 other into develop sooner rather later. I am playing a constant rebasing game because we're touching a lot of the same code. I'll let others decide upon the preferred solution here. On 2nd thought I think you're right about the discrete GPU case though I haven't seen it happen either even at scale. |
…ously with device kernels on the same memory. This leads to race conditions and incorrect results. In this commit, the host code is rewritten to avoid this situtation. Then the kernels can continue to be executed asychronously.
…al big jobs 32 (truncation 639) and 64 APUs (truncation 1279) for an MI300A in both discrete and APU mode. Norms look good.
…sed against develop.
ee30d43
to
d56a116
Compare
Any update on getting this merged? |
Yesterday I started coding up the fix suggested by Lukas. Once that's done we can compare the two and pick our favourite to merge. Alternatively if you'd like this PR merged urgently I can do that, and then we can consider a PR from my branch later. By the way, @PaulMullowney have you tried the develop branch recently on MI250x? Is it stable for you? |
This P.R. addresses race conditions that occur when running on MI300A as an APU. Host and device kernels using the same memory simultaneously can lead to incorrect results. The solution is to refactor and stash the results of the host computation so that the GPU kernels can continue to be run asynchronously. In the case of trltog, the refactored host code can be executed during MPI communication.
This branch has been run on MI300A (XNACK=0 (discrete GPU) and XNACK=1 (APU)) and A100s nodes.